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Abstract 

There is a rapidly increasing interest in crowdsourcing for data labeling. By crowdsourcing, 
a large number of labels can be often quickly gathered at low cost. However, the labels pro¬ 
vided by the crowdsourcing workers are usually not of high quality. In this paper, we propose 
a minimax conditional entropy principle to infer ground truth from noisy crowdsourced labels. 
Under this principle, we derive a unique probabilistic labeling model jointly parameterized by 
worker ability and item difficulty. We also propose an objective measurement principle, and 
show that our method is the only method which satisfies this objective measurement principle. 
We validate our method through a variety of real crowdsourcing datasets with binary, multiclass 
or ordinal labels. 
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1 Introduction 


In many real-world applications, the quality of a machine learning system is governed by the num¬ 
ber of labeled training examples, but the labor for data labeling is usually costly. There has been 
considerable machine learning research work on learning when there are only few labeled examples, 
such as semi-supervised learning and active learning. In recent years, with the emergence of crowd¬ 
sourcing (or human computation) services like Amazon Mechanical TurI0, the costs associated with 
collecting labeled data in many domains have dropped dramatically enabling the collection of large 
amounts of labeled data at a low cost. However, the labels provided by the workers are often not 
of high quality, in part, due to misaligned incentives and a lack of domain expertise in the workers. 
To overcome this quality issue, in general, the items are redundantly labeled by several different 
workers, and then the workers’ labels are aggregated in some manner, for example, majority voting. 

The assumption underlying majority voting is that all workers are equally good so they have 
equal vote. Obviously, such an assumption does not reflect the truth. It is easy to imagine that 
one worker is more capable than another in some labeling task. More subtly, the skill level of 
a worker may sig i iifican tly vary from one labeling category to another. To address these issues, 
Dawid and Skene ( I979l l propose a model which assumes that each worker has a latent probabilistic 
confusion matrix for generating her labels. The off-diagonal elements of the matrix represent the 
probabilities that the worker mislabels an item from one class as another while the diagonal elements 
correspond to her accuracy in each class. The true labels of the items and the confusion matrices 
of the workers can be jointly estimated by maximizing the likelihood of the workers’ labels. 

In the Dawid-Skene method, the performance of a worker characterized by her confusion matrix 
stays the same across all items in the same class. That is not true in many labeling tasks, where 
some items are more difficult to label than others, and a worker is more likely to mislabel a difficult 
item than an easy one. Moreover, an item may be easily mislabeled as some class rather than 
others by whoever labels it. To address these issues, we develop a minimax conditional entropy 
principle for crowdsourcing. Under this principle, we derive a unique probabilistic model which 
takes both worker ability and item difficulty into account. When item difficult is ignored, our 
model seamlessly reduces to the classical Dawid-Skene model. We also propose a natural objective 
measurement principle, and show that our method is the only method which satisfies this objective 
measurement principle. 


The work is an extension of the earlier results presented in ( Zhou et ah . 20121 . 2014 1 


We 

organize the paper as follows. In Section [2l we propose the minimax conditional entropy principle 
for aggregating multiclass labels collected from a crowd and derive its dual form. In Section 
[3l we develop regularized minimax conditional entropy for preventing overfitting and generating 
probabilistic labels. In Section [U we propose the objective measurement principle which also leads 
to the probabilistic model derived from the minimax conditional entropy principle. In Section [Sj we 
extend our minimax conditional entropy method to ordinal labels, where we need to introduce a new 
assumption called adjacency confusability. In Section [6l we present a simple yet efficient coordinate 
ascent method to solve the minimax program through its dual form and also a method for model 
selection. Related work are discussed in Section [71 Empirical results on real crowdsourcing data 
with binary, multiclass or ordinal labels are reported in Section [8l and conclusion are presented in 
Section [9l 


^ https://www.mturk.com 
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Figure 1: Left table: observed labels Xij provided by worker i for item j. Right table; underlying 
distributions TTjj of worker i for generating a label for item j. In our approach, the rows and columns 
of of the unobserved right table are constrained to match the rows and columns of the observed 
left table. 

2 Minimax Conditional Entropy Principle 

In this section, we present the minimax conditional entropy principle for aggregating crowdsourced 
multiclass labels in both its primal and dual forms. We also show that minimax conditional entropy 
is equivalent to minimizing Kullback-Leibler (KL) divergence. 

2.1 Notation and Problem Setting 

Assume that there are a group of workers indexed by i, a set of items indexed by j, and a number 
of classes indexed by k or c. Let Xij be the observed label that worker i assigns to item j, and Xij 
be the corresponding random variable. Denote by Q{Yj = c) the unobserved true probability that 
item j belongs to class c. A special case is that Q{Yj = c) = 1 and Q{Yj = k) = 0 for any other 
class k ^ c. That is, the labels are deterministic. Denote by P{Xij = k\Yj = c) the probability that 
worker i labels item j as class k while the true label is c. Our goal is to estimate the unobserved 
true labels from the noisy workers’ labels. 

2.2 Primal Form 

Our approach is built upon two four-dimensional tensors with the four dimensions corresponding 
to workers i, items j, observed labels k, and true labels c. The first tensor is referred to as the 
empirical confusion tensor of which each element is given by 

k) = Q{Yj = c)l{xij = k) 

to represent an observed confusion from class c to class k by worker i on item j. The other tensor 
is referred to as the expected confusion tensor of which each element is given by 

(t>ij{c, k) = Q{Yj = c)P{Xij = k\Yj = c) 

to represent an expected confusion from class c to class k by worker i on item j. 

We assume that the labels of the items are independent. Thus, the entropy of the observed 
workers’ labels conditioned on the true labels can be written as 

HiX\Y) = - ^ Q{Yj = c) ^ P{Xij = k\Y, = c) log P(W,- = k\Y, = c). 

j,c i,k 
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Figure 2: An illustration of the empirical confusion tensors. The table contains three workers’ 
labels over six items. These items are assumed to have deterministic true labels as follows: class 1 
= {item 1, item 2}, class 2 = {item 3, item 4}, and class 3 = {item 5, item 6}. The (c, k)-th. entry 
of matrix 0j represents the number of the items labeled as class k by worker i given that their true 
labels are class c. 


Both the distributions P and Q are unknown here. To attack this problem, we first consider a 
simpler problem: estimate P when Q is given. Then, we proceed to jointly estimating P and Q 
when both are unknown. 

Given the true label distribution Q, we propose to estimate P which generates the workers’ 
labels by 

max H{X\Y), (1) 


subject to the worker and item constraints (Figure [IJ 


^ ^4>ij{c,k) - 0ij(c,/c) 

j 

^ ^4>ij{c,k) - 0ij(c,/c) 


I 


= 0, Vi, k, c, 
= 0, Vi, k, c. 


(2a) 

(2b) 


plus the probability constraints 


Y^P{Xij = k\Yj = c) = l, 

k 

J2Q{Yj = c) = i, Vi, 

c 

QiYj = c) > 0, Vi, c. 


(3a) 

(3b) 

(3c) 


The constraints in Equation (I2al) enforce the expected confusion counts in the worker dimension 
to match their empirical counterparts. Symmetrically, the constraints in Equation (I2bp enforce 
the expected confusion counts in the item dimension to match their empirical counterparts. An 
illustration of empirical confusion tensors is shown in Figure [2l 

When both the distributions P and Q are unknown, we propose to jointly estimate them by 


min max P[{X\Y), (4) 

subject to the constrains in Equation ([2]) and {3]). Intuitively, entropy can be understood as a 
measure of uncertainty. Thus, minimizing the maximum conditional entropy means that, given 
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the true labels, the workers’s labels are the least random. Theoretically, minimizing the maximum 
conditional entropy can be connected to maximum likelihood. In what follows, we show how the 
connection is established. 

2.3 Dual Form 

The Lagrangian of the maximization problem in (jH) can be written as 

L = H{X\Y) + L„ + Lr + Lx (5) 


with 


= '^ai{c,k)'^ {c,k) - {c,k) , 

i, c,k j 

Lr = ^ Tj (C, k) ^ y>ij (C, k) - (j)ij (c, k) , 

j, c^k i 




Lx =Y1 P{Xij = k\Yj=c)-I 


k 


where ai{c,k),Tj{c,k) and Xnr are introduced as the Lagr ange multipliers. By the Karush-Kuhn- 
Tucker (KKT) conditions ( Bovd and Vandenberghe . 20041 1. 


dL 


dP{Xij = k\Yj = c) 


= 0 , 


which implies 


log P{Xij = k\Yj = c) = Xijc - 1 + (Tj(c, k) + Tj{c, k). 

Combining the above equation and the probability constraints in (|3all eliminates A and yields 

P{Xij = k\Yj = c) = ^ exp[(Tj(c, k) + Tj{c, k)], 


( 6 ) 




where Zij is the normalization factor given by 


Zij = ^ exp[cji(c, k) + Tj{c, k)]. 


Although the matrices [ai{c,k)] and [Tj(c,k)] in Equation ([6]) come ont as the mathematical con¬ 
sequence of minimax conditional entropy, they can be nnderstood intuitively. We can consider the 
matrix [crj(c, fe)] as the measnre of the intrinsic ability of worker i. The (c, fc)-th entry measnres 
how likely worker i labels a randomly chosen item in class c as class k. Similarly, we can consider 
the matrix [Ti(c, k)] as the measure of the intrinsic difficnlt of item j. The (c, A:)-th entry measures 
how likely item j in class c is labeled as class A: by a randomly chosen worker. In the following, we 
refer to [<Tj(c, A)] as worker confusion matrices and [Ti{c,k)] as item confusion matrices. 

Substituting the labeling model in Equation ([6]) into the Lagrangian in Equation ([5]), we can 
obtain the dual form of the minimax problem (j4]) as (see Appendix lA|) 


^ Q{Yj = c) ^ logP{Xij = Xij\Y, = c). 


max 

cr,T,Q 


( 7 ) 














It is obvious that, to be optimal, the true label distribution has to be deterministic. Thus, the dual 
Lagrangian can be equivalently expressed as the complete log-likelihood 

{n E n =c) I • 

^ j c i ^ 

In Sectional we show how to regularize the objective function in Q to generate probabilistic labels. 

2.4 Minimizing KL Divergence 

Let us extend the two distributions P and Q to the product space Xv.Y. We extend the distribution 
Q by defining Q(Xij = Xij) = 1, and Q{Y) stays the same. We extend the distribution P with 
P{X,Y) = W^ - P{Xij\Yj)P{Yj)^ where P{Xij\Yj) is given by Equation ([6]), and P{Y) is a uniform 
distribution over all possible classes. Then, we have 

Theorem 2.1 When the true labels are deterministic, minimizing the KL divergence from Q to P, 
that is, 

mm!^DKL{Q\\P) = ^QiX,Y)log^^p^^, (8) 

is equivalent to the minimax problem in ©• 

The proof is presented in Appendix [Bl A sketch of the proof is as follows. We show that, 

Dkl{Q II ^) = - E <3(^1 = c) E = c) log P{X^j = k\Yj = c) 

j,c i,k 

+ J]Q(F)iogQ(y)-iogP(y). 

y 

By the definition of P{X, Y), P(Y) is a constant. Moreover, when the true labels are deterministic, 
we have 

^ Q(y) log Q(y) = o. 

y 

This concludes the proof of this theorem. 

3 Regularized Minimax Conditional Entropy 

In this section, we regularize our minimax conditional entropy method to address two practical 
issues: 

• Preventing overfitting. While crowdsourcing is cheap, collecting many redundant labels 
may be more expensive than hiring experts. Typically, the number of labels collected for each 
item is limited to a small number. In this case, the empirical counts in Equation ([2]) may 
not match their expected values. It is likely that they fluctuate around their expected values 
although these fluctuations are not large. 
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• Generating probabilistic labels. Our minimax conditional entropy method can only gen¬ 
erate deterministic labels (see Section [2]3]) . In practice, probabilistic labels are usually more 
useful than deterministic labels. When the estimated label distribution for an item is close 
to uniform over several classes, we can either ask for more labels for the item from the crowd 
or forward the item to an external expert. 

For addressing the issue of overfitting, we formulate our observation by replacing exact matching 
with approximate matching while penalizing large fluctuations. For generating probabilistic labels, 
we consider an entropy regularization over the unknown true label distribution. This is motivated 
by the analysis in Section 12.41 

Formally, we regularize our minimax conditional entropy method as follows. Let us denote the 
entropy of the true label distribution by 




To estimate the true labels, we consider 


min max H(X\Y) — H{Y) - 

Q p a /3 


subject to the relaxed worker and item constraints 


3 

E - ^ijic,k) 


I 


Ci{c,k), 
Cj{c,k), yj,s, 


(9) 


(10a) 

(10b) 


plus the probability constraints in Equation ([3|). The regularization functions 11 and T are chosen 
as 


i c,k 

(11a) 

*K) = 5EE[0(<;.*)t^ 

(lib) 


j c,k 


The new slack variables ^j(c, k),Cjic, k) in Equation (fTOjl model the possible fluctuations. Note that 
these slack variables are not restricted to be positive. When there are a sufficiently large number 
of observations, the fluctuations should be approximately normally distributed, due to the central 
limit theorem. This observation motivates the choice of the regularization functions in (llip to 
penalize large fluctuations. The entropy term H{Y) in the objective function, which is introduced 
for generating probabilistic labels, can be regarded as penalizing a large deviation from the uniform 
distribution. 

Substituting the labeling model from Equation Q into the Lagrangian of ([9]), we obtain the 
dual form (see Appendix ICl) 


max 'S^ Q{Yj = c)'S^ \ogP{Xij = Xij\Yj = c) + H{Y) - aQ.*{a) - (12) 

a,T,Q * ^ ^ 

i 
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where 


i c,k 

'^*('^) = ■ (14) 

j c,k 

When a = 0 and /3 = 0, the objective function in (fT^ turns out to be a lower bound of the log 
marginal likelihood 

i°4nEn^(^b=^bi4^.=c)} 

^ j c i ^ 

=‘“<5 {n E n = 0 )} 

j c 1 

> ^ Q{Y^ = c) ^ log P{X,j = \Yj = c) + H{Y). 
j,c i 

The last step is based on Jensen’s inequality. Maximizing the marginal likelihood is more ap¬ 
propriate than maximizing the complete likelihood since only the observed data matters in our 
inference. 

Finally, we introduce a variant of our regularized minimax conditional entropy. It is obtained 
by restricting the feasible region of the slack variables through 


This is equivalent to 


^^i(c,c) = 0, Vh 


^ ^ c) (j)ij{c,c) 

i>c 


0, Vi. 


(15) 


It says that, the empirical count of the correct answers from each worker is equal to its expectation. 
According to the law of large numbers, this assumption is approximately correct when a worker has 
a sufficiently large number of correct answers. Note that this does not mean that the percentage 
of the correct answers from the worker has to be large. Let K denote the class size. Under the 
additional constraints in Equation (fT5]l , the dual problem can still be expressed by (fT^ except (see 
Appendix [C]) 




^E( <^i(c,c) - cri(c,c) [^CJi(c, fe) -CJi(c,/c) 


fc^c 


(16) 


where ^ ^ 

c^*(c,c) = — ^(Ti(c,c), ai{c,k) = 

c C k^c 

From our empirical evaluations, this variant is somewhat worse than its original version on most 
datasets. We include it here only for theoretic interest. 












4 Objective Measurement Principle 

In this section, we introduce a natural objective measurement principle, and show that the proba¬ 
bilistic labeling model in Equation ([6]) is a consequence of this principle. 

Intuitively, the objective measurement principle can be described as follows: 

1. A comparison of labeling difficulty between two items should be independent of which par¬ 
ticular workers were involved in the comparison; and it should also be independent of which 
other items might also be compared. 

2. Symmetrically, a comparison of labeling ability between two workers should be independent 
of which particular items were involved in the comparison; and it should also be independent 
of which other workers might also be compared. 

Next we mathematically define the objective measurement principle. 

Assume that worker i has labeled items j and j' in class c. Denote by E the event that one of 
these two items is labeled as k, and the other is labeled as c. Formally, 

E = = k)+ = k) = l, I{Xij = c) + l{Xij, = c) = 1} . 

Denote by A the event that item j is labeled as k and item j' is labeled as c. Formally, 

A — ^Xij — A:, Xijf — cj . 

It is obvious that A (Z E. Now we formulate the requirement (1) in the objective measurement 
principle as follows: P{A\E) is independent of worker i. Note that 

p. ^_ P{X,,=k\Y,=c)P{X,j,=c\Y,,=c) _ 

^ ^ P{Xij = k\Yj = c)P{Xif = c\Yj, =c) + P{Xij = c\Yj = c)P{Xij, = k\Yj: = c)' 

Hence, P{A\E) is independent of worker i if and only if 

P{Xij = k\Yj = c)P{Xij, = c\Yj, = c) 

P{X^j = c\Yj = c)PiXij. = k\Yj, = c) 

is independent of worker i. In other words, given another arbitrary worker i', we should have 

PiXij = k\Yj = c)P{Xij, = c\Yj, =c) _ P{Xi^j = k\Yj = c)P{Xi,j, = c\Yj, = c) 

P{Xij = c\Yj = c)P{Xij, = k\Yj, = c) ~ P{Xi,j = c\Yj = c)P{Xi,j, = k\Yj, = c)' 

Without loss of generality, we choose i' = 0, j' = 0 as the fixed references. Then, 

P{X,j = k\Yj = c) P{Xm = k\Yo = c) P{Xoj = k\Y,=c) 

P{X,, = c\Yj = c)°" P{Xio = clFo = c) P{Xoj = c\Yj = c) ' 

By the fact that probabilities are nonnegative, we can write 

P{Xio = k\Yo = c) = exp[fji(c, k)], P{Xoj = k\Yj = c) = exp[rj(c, k)]. 

The probabilistic labeling model in Equation Q follows immediately. It is easy to verify that due 
to the symmetry between item difficulty and worker ability, we can instead start from formulating 
the requirement (2) in the objective measurement principle to achieve the same result. Hence, in 
this sense, the two requirements are actually redundant. 
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5 Extension to Ordinal Labels 


In this section, we extend the minimax conditional entropy principle from multiclass to ordinal 
labels. Eliciting ordinal labels is important in tasks such as judging the relative quality of web 
search results or consumer products. Since ordinal labels are a special case of multiclass labels, the 
approach that we have developed in the previous sections can be used to aggregate ordinal labels. 
However, we observe that, in ordinal labeling, workers usually have an error pattern different 
from what we observe in multiclass labeling. We summarize our observation as the adjacency 
confusability assumption, and formulate it by introducing a different set of constraints for workers 
and items. 


5.1 Adjacency Confusability 

In ordinal labeling, workers usually have difficulty distinguishing between two adjacent ordinal 
classes whereas distinguishing between two classes which are far away from each other is much 
easier. We refer to this observation as adjacency confusability. 

To illustrate this observation, let us consider the example of screening mammograms. A mam¬ 
mogram is an x-ray picture used to check for breast cancer in women. Radiologists often rate 
mammograms on a scale such as no cancer, benign cancer, possible malignancy, or malignancy. 
In screening mammograms, a radiologist may rate a mammogram which indicates possible malig¬ 
nancy as malignancy, but it is less likely that she rates a mammogram which indicates no cancer 
as malignancy. 


5.2 Ordinal Mininicix Conditional Entropy 


In what follows, we construct a different set of worker and item constraints to encode adjacency 
confusability. The formulation leads to an ordinal labeling model parameterized with structured 
confusion matrices for workers and items. 

We introduce two symbols A and V which take on arbitrary binary relations in {>,<}. Ordinal 
labels are represented by consecutive integers, and the minima one is 0. To estimate the true ordinal 
labels, we consider 

nun mux H{X\Y) (17) 

subject to the ordinal-based worker and item constraints 


EEE[ (pij (c, /c) (pij (c, k') 

cAs kVs j 

EEE[ (f>ij (c, /c) (j)ij (c, k') 

cAs fcVs i 


0, Vi, s > 1, 
0, Vj,s > 1, 


(18a) 

(18b) 


for all A, V € {>,<}, and the probability constraints in ([3|). We exclude the case s = 0 in which 
the constraints trivially hold. 

Let us explain the meaning of the constraints in Equation (|18p . To construct ordinal-based 
constraints, the first issue that we have to address is how to compare the observed label Xij and 
the true label Yj in an ordinal sense. For multiclass labels, as we have seen in Section [2l the label 
comparison problem is trivial: we only need to check whether they are equal or not. For ordinal 
labels, such a problem becomes tricky. Here, we propose an indirect comparison between two 
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Figure 3: Indirect comparison between a true label and an observed label via comparing both to a 
reference label which varies through all possible values in a given ordinal label set. 


ordinal labels by comparing both to a reference label s which varies through all possible values in 
a given ordinal label set (Figure [3|). Consequently, for every chosen reference label s, we partition 
the Cartesian product of the label set into four disjoint regions 

{(c, k)\c < s,k < s}, {(c, k)\c < s,k > s}, 

{(c, k)\c > s,k < s}, {(c, k)\c > s,k > s}. 

A partition example is shown in Table 1 where the given label set is {0,1,2,3}. Then, Equation 
(IlSap defines a set of constraints for the workers by summing Equation (|2al) over each region. 
Similarly, Equation (jlSbh defines a set of constraints for the items by summing Equation ()2bp over 
each region. 

From the discussion above, we can see that when there are more than two ordinal classes, the 
constraints in Equation (|18l) are less restrictive than those in Equation ([2]). Consequently, as we 
see below, the labeling model resulted from Equation (|18l) has fewer parameters. In the case in 
which there are only two ordinal classes, the sets of disjoint regions degenerate to pairs (c, k) and, 
thus, the sets of constraints in Equations (fT8]l and ([2]) are identical. 

Next we explain why we construct the ordinal-based constraints in such a way. Let us write 

Z] Z] = 

cAs /cVs j j cAs fcVs 

j cAs fcVs 

= ^Q(yjAs)I(xijVs). 

3 

For example, when A =< and V =>, the above equation becomes 


E E E = E ^ '5)- 

j c<s k>s j 

This counts the items of which each belongs to a class less than s but worker i assigned a label 
larger or equal to s. 

In general, for a comparison between an observed label and a reference label, there are two 
possible outcomes: the observed label is larger or equal to the reference label; or the observed label 
is smaller than the reference label. These are also the two possible outcomes for a comparison 
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(a) Partitioning with s = 1 


(0,0) 

(0,1) 

(0,2) 

(0,3) 

(1,0) 

(1,1) 

(1,2) 

( 1 , 3 ) 

(2,0) 

(2,1) 

(2,2) 

(2,3) 

(3,0) 

( 3 , 1 ) 

(3,2) 

(3,3) 

(b) Partitioning with 

= 2 

(0,0) 

(0,1) 

(0,2) 

(0,3) 

(1,0) 

(1,1) 

(1,2) 

( 1 , 3 ) 

(2,0) 

(2,1) 

(2,2) 

(2,3) 

(3,0) 

( 3 , 1 ) 

(3,2) 

(3,3) 

(c) Partitioning with i 

= 3 

(0,0) 

(0,1) 

(0,2) 

(0,3) 

(1,0) 

(1,1) 

(1,2) 

( 1 , 3 ) 

(2,0) 

(2,1) 

(2,2) 

(2,3) 

(3,0) 

( 3 , 1 ) 

(3,2) 

(3,3) 


Table 1: Partitioning the Cartesian product of the ordinal label set {0,1,2,3}. With respect to 
each possible reference label, each table is partitioned into four disjoint regions. 


between a true label and a reference label. Putting these together, we have four possible outcomes 
in total. The constraints in Equation (IlSaji enforce expected counts of all the four kinds of outcomes 
in the worker dimension to match their empirical counterparts. Symmetrically, the constraints in 
Equation ()18b() enforce expected counts of all the four kinds of outcomes in the item dimension to 
match their empirical counterparts. 

The Lagrangian of the maximization problem in (jl7p can be written as 

L = H{X\Y) + L^ + Lr + Lx, 


with 


La Z] 

2 ,s A,V cAs kVs j 

Lr='^Yl k) - k) 

j^s A,V cAs /cVs i 


Lx — Kjc 


l,J,C 




P{Xij =k\Y,=c)-l 


where and Xijc are the introduced Lagrange multipliers. By a procedure similar to that 

in Section [2l we obtain a probabilistic ordinal labeling model 


P{Xij =k\Yj 


c) = exp[c7i(c, k) + Tj{c, k)], 

^ij 


(19) 
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where 


^(c, fc) = X] X] kXs), 

(20a) 

S>1 A,V 



(20b) 

s>l A,V 



The ordinal labeling model in Equation (1191) is actually the same as the multiclass labeling model 
in Equation ([6]) except the worker and item confusion matrices in Equation (jl9[) are now subtly 
structured through Equation (1201) . It is because of the structure that the ordinal labeling model 
has fewer parameters than the multiclass labeling model when there are more than two classes. In 
the case in which there are only two classes, the ordinal labeling model and the multiclass labeling 
model coincide as one would expect. 

The regularized minimax conditional entropy for ordinal labels can be written as 

min max H{X\Y) - H(Y) - -11(0 “ 4^(0 (21) 

P 

subject to the relaxed worker and item constraints 

^ ^ ^ (c, k) - ^ij (c, k) 

cAs fcVs j 

(c> k) - ^ij (c, k) 

cAs kS/s i 

for all A, V G {>, <}, and the probability constraints in Equation ([3D. When we choose 

i, s A,V 

j, s A,V 


= 0t’^,Vz,s, (22a) 

= (22b) 


the dual problem becomes 


max ^ Q{Yj = c) ^ log P{Xij = Xij\Yj = c) + H{Y) - an*(cr) - /3T*(r), 
% 


where 


1 / A,V^ ^ 


'^•w = 5EEE 


A,V 


»‘W = ^EEiv 

i:« A,V 


A,v 
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5.3 Ordinal Objective Measurement Principle 

In this section, we adapt the objective measurement principie deveioped in Section 0] to ordinai 
iabeis. 

Assume that worker i has iabeied items j and j' in ciass c. For any ciass k, we dehne two events. 
The first event is 

E = \l{Xij = k) + = k) = 1, ^{Xij = k + 1) + I{Xij/ = /c + 1) = l| , 

and the other event is 

A = {^Xij = k, Xiji = A; + ij- . 

Note that A C E. Now we formuiate the objective measurement principie as foiiows: P{A\E) is 
independent of worker i. Assume that the iabeis of the items are independent. Then, P{A\E) can 
be written as 

P{Xij = k\Yj = c)P{Xif = k + l\Yf = c) 

P{Xij = k\Yj = c)P{Xij, = k + l\Yj, = c) + P{Xij = k + l\Yj = c)P{Xij, = k\Yj, = c)' 

Hence, P{A\E) is independent of worker i if and oniy if 

P{Xij = k\Yj = c)P{Xij, = k + l\Yj, = c) 

P{Xij = k + l\Yj = c)P{Xij> = k\Yj, = c) 

is independent of worker i. In other words, given another arbitrary worker i', we should have 

P{Xij = k\Yj = c)P{Xif = k + l\Yf =c) _ P{Xi>j = k\Yj = c)P{Xi,j, = k + l\Yj, = c) 

P{Xij = k + l\Yj = c)P{Xij, = k\Yj, = c) ~ P{Xi,j = k + l\Yj = c)P{Xi>f = k\Yj, = c)' 

To introduce adjacency confusability, we further assume that, for any two classes c, c' > k + 1 (or 
c, c' < + 1), 

P{Xij = k\Yj = c)P{Xij, =k + l\Yj, = c) _ P{Xij = k\Yj = c')P{Xiy = k + l\Yj^ = c') 

P{Xij = k + l\Yj = c)P{Xiy = k\Yj, = c) ~ P{Xij = k + l|yj- = c')P{Xif = k\Yf = c')' 

Then, by a procedure similar to that in Section 01 we reach the probabilistic ordinal labeling model 
described by Equation (flUjl and (l^UD . 


6 Implementation 

In this section, we present a simple while efficient coordinate ascent method to solve the minimax 
program through its dual form and also a practical procedure for model selection. 


6.1 Coordinate Ascent 


The dual problem of regularized minimax conditional entropy for either multiclass or ordinal labels 
is nonconvex. A stationary point can be obt ained via coordinate ascent (Algorithm P, wh ich is 


essentially Expectation-Maximization (EM) (|DemDster et al.l . 119771 : iNeal and Hintonl . Il998l j. We 
first initialize the label estimate via aggregating votes in Equation ()23p . Then, in each iteration 


step, given the current estimate of the labels, update the estimate of the confusion matrices of the 
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Algorithm 1 Regularized Minimax Conditional Entropy for Crowdsourcing 

input: /? 

initialize: 

Q(Yj = c) oc = c) (23) 

repeat: * 

{u, r} = argmax QiYj = c) log P(Xij = XiAYj = c) — aQ,*{a) — (24a) 

(7,T * ^ 

g(F, =c) oc l[P{Xij = Xij\Yj = c) (24b) 

i 

output: Q 


workers and items by solving the optimization problem in (I24ap : and, given the current estimate of 
the confusion matrices of worker and item, update the estimate of the labels through the closed- 
form formula in (j24bp . which is identical to applying the Bayes’ rule with a uniform prior. The 
optimizati c jn pro blem in (j24ap is strongly convex and smooth. Many algorithms can be applied here 


( Nesterov . 2004l i . In our experiments, we simply use gradient ascent. Denote by F the objective 


function in (j24ap . For multiclass labels, the gradients are computed as 


dP 


dai{c, k) 


dP 


dTj{c, k) 


QiYj = c) [I{xij = k) - P{Xij = k\Yj = c)] - aai{c, k), 
j 

QiYj = c) [I{xij = k) - P{Xij = k\Yj = c)] - /3rj(c, k). 


For ordinal labels, the gradients are computed as 


is c,k j 

ip 

^ ^ Qi^j = c) 


k) - PiXi, = k\Yj = c)] - 
k)-P{X,j=k\Y, = c)]-PT^;^. 


It is worth pointing out that it is unnecessary to obtain the exact optimum at this intermediate 
step. We have observed that in practice, several gradient ascent steps here suffice for reaching a 
final good solution. 


6.2 Model Selection 

The regularization parameters a. and j3 can be chosen as follows. If the true labels of a subset 
of items are known—such subsets are usually referred to as validation sets—we may choose the 
regularization parameters such that those known true labels can be best predicted. Otherwise, 
we suggest to choose the regularization parameters via /c-fold likelihood-based cross-validation. 
Specifically, we hrst randomly partition the crowd labels into k equal-size subsets, and define a 
finite set of possible choices for the regularization parameters. Then, for each possible choice of the 
regularization parameters. 
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1. Leave out one subset and use the remaining k — 1 subsets to estimate the confusion matrices 
of the workers and items; 

2. Plug the estimate into the probabilistic labeling model to compute the likelihood of the left- 
out subset; 

3. Repeat the above two steps till each subset is left out once and only once; 

4. Average the likelihoods that we have computed. 

After going through all the possible choices for the regularization parameters, we choose the one 
which results in the largest average likelihood to run our algorithm over the full dataset. The 
cross-validation parameter k is typically set to 5 or 10. 

To simplify the model selection process, we suggest to choose 

a = 7 X (number of classes)^, 

^ number of labels per worker (25) 

^ ^ number of labels per item ^ 

In our experiments, we select 7 from {2“^, 2“^, 2^, 2^, 2^}. In our limited empirical studies, larger 
candidate sets for 7 did not give more gains. Two empirical observations motivate us to consider 
using the square of the number of classes in Equation (1251) . First, the square of the number of 
classes has the same magnitude as the number of parameters in a confusion matrix. Second, the 
label noise dramatically increases when the number of classes increases, requiring a super linearly 
scaled regularization. 


7 Related Work 


In this section, we review some existing work that are closely rel ated to our work. 

Dawid-Skene Model. Let K denote the number of classes. Dawid and Skene ( 1 9791 ) propose 
a generative model in which the ability of worker i is characterized hy a K x K probabilistic 
confusion matrix [pj(c. A;)] in which the diagonal element pi(c,c) represents the probability that 
worker i correctly labels an arbitrary item in class c, and the off-diagonal element Pi{c, k) represents 
the probability that worker i mislabels an arbitrary item in class c as class k. Our probabilistic 
labeling model in Equation ([ 6 ]) is reduced to the Dawid-Skene model when the item difficult terms 
Tj (c, k) in our model disappear since we can then reparameterize 


Pi{c,k) 


exp[(7i(c, k)] 
Y,k' exp[o-i(c, k')]' 


In this sense, our model generalizes the Dawid-Skene model to incorporate item difficulty. To 
jointly estimate the workers’ abilities and the true labels in the Dawid-Skene model, in general, the 
marginal likelihood is maximized using the EM algorithm. 

For binary labeling task, the probabilistic confusion matrix in the Dawid-Skene model can be 
written as 

^ Pi I - pi\ 

V-Qi Qi / ’ 
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where pi is the accuracy of worker i in the first class, and qi the accuracy in the second class. 


( Ravkar et ah . l2ninl: Ihi^ et al 


assuming pi = qi (j Ghosh et al 


, 2012: 

Chen et al.. 

, 2011; 

Karger et al. 


201311. One may simplify the two-coin model by 


Karger et al. . 2014 : Dalvi et ah . 20131 ). This simplification is 


acc ordingly refer r ed to as the one-coin model. 

Karger et al. ( 20141 ) propose an inference algorithm under the one-coin model, and show that 


their algorithm achieves the minimax rate when the accuracy of eve ry worker is bou nded away 
from 0 and 1, that is, wit h some fixed number e > 0, e < p* < 1 — e. Liu et al. ( 20121 ) show that 
the algorithm proposed by iKarger et al.l (j2014l l is essentially a belief propagation update with the 
Haldane prior which assumes that each worker is either a hammer (pi = 1) or adversary (p, = 0) 
wit h equal probab i lity. 

Gao and Zhou ( 2013l l show that under the one-coin model, the global optimum of maximum 


likelihood achieves the minimax rate. A projecte d EM algorithm is s uggested and shown to achieve 
nearly the same rate as that of global optimum. IZhang et al.l (j2014l i show that the EM algorithm 
for the general Dawid-Skene model can achieve the miii i max rate up to a logarithmic factor when 
it is initialized by spectral methods ( Anandkumar et al. . 20121 ') and the accuracy of every worker is 
bo unded away fr o m 0 a nd 1. 

Ravkar et al. ( 201ol ) extend the Dawid-Skene model by imposing a beta prior over the worker 


confusion matrices. Moreover, they jointly learn t he classifie r and the true labels by assuming that 
the true labels are generated by a logistic model. Liu et al. ( 2012l i develop full Bayesian inference 
via variational meth ods including belief p ropagation and mean field. 


Rasch model (Rasch, 1961 . 19681 ). In educational tests, the Rasch model illustrates the 


response of each examinee of a given ability to each item in a test. In the model, the probability of 
a correct response is modeled as a logistic function of the difference between the person and item 
parameter which are locations on a continuous latent trait. Person parameters represent the ability 
of examinees while item parameters represent the difficulty of items. 

Let Xij € {0,1} be a dichotomous random variable where Xij = 1 denotes a correct response 
and Xij = 0 an incorrect response to a given assessment item. Mathematically, the Rasch model is 
given by 

p(x -n- 

^ ^ l + exp(A-5,)’ 

where /3j is the ability of examinee i and 5i the difficulty of item j. The larger an examinee’s ability 
relative to the difficulty of an item, the larger the probability of a correct response on that item. 
When the examinee’s ability on the latent trait is equal to the difficulty of the item, the probability 
of a correct response is 1/2. 

The Rasch model is a special item response theory (IRT) model (iLord and Novicld . Il968l ). How¬ 
ever, unlike other IRT models, the Rasch model satisfies the objective measurement principle pi¬ 
oneered by Rasch. Our work generalizes both the Rasch model and the objective measurement 
principle to multiclass labeling tasks. In addition, unlike the Rasch model, in our scenario, the true 
answers are unknown and have to be estimated. 

Polytomous Rasch model. The Rasch model has been adapted to the applications in which 
response s to ite ms are scored with successive integers such as rating scales. Let Xij = {0,1, • • • , m}. 
Andrich ( 19781 ) suggests 


exp ELoIA - - T-«)] 


P{Xij = k) = 


E^=oexpEto[A-('^.-D)]' 
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where /3j is the location of person i on a latent continuum, 5j the difficulty of item j on the same 
continuum, and Tg the s-th threshold location of the rating scale which is in common to all the 
items. This model is usual ly referred to as the Rasch rating scale model. Later, the Rasch partial 
credit model developed by Master ( 1982l i generalizes the Rasch rating scale model into 

■vfc 


P{Xij = k) = 


E^=oexpEto(A-^.s)^ 


where Tjs is the s-th threshold location of item i on a latent continuum. When r. 


can be decom¬ 
posed as Tjs = 6j — Ts, these two models coincide. Uebersax and Grove ( 1993l i and Mineirol ( 2011bl i 
apply the polytomous form of the Rasch model with minor changes to aggregate ordinal labels from 
a crowd. 

Probabilistic matrix factorization. Let X jj be the label given by worker i to item j. Let Yj 
be the true label of item j. Whitehill et al. ( 2009l i model the labeling process by revising the Rasch 
model into 

Pix„ = U) =- 


1 + exp 1^- 

and refer to their model as GLAD (Generative Model of Labels, Abilities, and Difficulties). It is easy 
to see that GLAD violates the prin ciple of invariant comparison. By using the per-worker confusion 
matrix in the Dawid-Skene model, Mineirol (2011a) generalizes GLAD to multiclass labeling as 


P{Xij = k\Yj = c) oc exp 


f3i{c,k) 


( Welinder et al. . 2O10li parameterize workers and items with vectors and suggest 

P{X,,=Y) = HwJzj-bj), 

the unobserved worker 


where $(•) is the cumulative standardized normal distribution, m 


€ 


parameter, and Zj € M“, bj € M the unobserved item parameter. GLAD can be roughly thought of 
as a special case of this model with the dimension d = 1 and bj = 0. 

Other related work. 


to (I 

lachrach et ah. 2012: 

Tian and Zhu. 2012: 

Dai et ah 

2013: Venanzi et ah. i 

10141. For online 

decision making in crowdsourcing, we refer the readers to ( 

Shene: et ah. 2008: 

Abraham et ah. 20131: 

Chen et ah. 2013: Sinda and Krause. 2013: 

Ho et ah. 

2014:lAnari et ahl. 20141. Regularized maxi- 

mum entropy is studied in 

(Chen and Rosenfeld. 

200 d; 

Lebanon and Laffertv. 

2001 

: Kazama and Tsuiii 

2003 

: Altun and Smola. 2 

006: Dudik et ah. 

20071'). Zhu et ah (1997') propose a minimax entropv 


method for feature binding an d selection, and apply i t to texture modeling and obtain a new class of 


Markov random field models. IShah and Zhod ( 


2ni4l i propose a multiplicative payment mechanism 


to incentivize crowdsourcing workers to answer a question when they are sure and skip when they 
are not sure. They obtain extremely high quality crowdsourced data by using their mechanism. 


8 Experiments 

In this section, we report empirical results of our method and some existing methods discussed 
in Section [7l Two error metrics are considered. One is the classification error rate for binary or 
multiclass data, and the other is the mean square error for ordinal data. 
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8.1 Datasets 

All datasets that we use are from real crowdsourcing tasks and publicly available!! The details 
as follows; 


are 


Bluebirds ( Welinder et ah . 2010lh This dataset contains a set of 108 images which are 


labeled as indigo bunting or blue grosbeak by 39 crowdsourcing workers. Every worker labeled 
every image. The average error rate of the workers is 36.44%, compared to the error rate of 
random guessing at 50%. 


Price (iLiu et al.l . l2013l ) . This dataset consists of 80 household items collected from stores 
such as Amazon and Costco. The prices of the products are estimated by 155 undergraduate 
students from UC Irvine. Seven price bins are created in this data collection: $0—$50, 
$51-$100, $101-$250, $251-$500, $501-$1000, $1001-$2000, and $2001-$5000. For each 
product, a student has to to decide which bin its price falls in. The average error rate of the 
students is 69.47%, compared to the error rate of random guessing at 85.71%. It may not be 
surprising that this dataset is systematically biased: all the students tend to underestimate 
the prices of the products. 


RTE (ISnow et al.L 120081 ). For each crowdsourced question, the worker is presented with 


two sentences and asked to check if the second hypothesis sentence can be inferred from the 
first. This dataset contains 800 sentence pairs and 164 workers. Each sentence pair has 10 
annotations. The average error rate of the workers is 15.87%, compared to the error rate of 
random guessing at 50%. 


Temp (jSnow et al.l. 120081) . For each crowdsourced question, the worker is presented with 
a pair of verb events and asked to check if the event described by the hrst verb occurs before 
or after the second. This dataset contains 462 event pairs and 76 workers. Each event pair 
has 10 annotations. The average error rate of the workers is 16.30%, compared to the error 
rate of random guessing at 50%. 


Age IlHan et al.l . 120141 1. Amazon mechanical turkers are asked to estimate the age of a 


person in a face image. This dataset contains 1002 images and 165 workers. Each image has 
10 age estimates. Those estimates are integers not more than 100. We put them into 7 bins: 
[1, 9], [10, 19], [20, 29], [30, 39], [40, 49], [50, 59], [60, 100]. With respect to this partition, the 
average error rate of the workers is 44.64%, compared to the error rate of random guessing 
at 85.71%. 


Web search (jZhou et all 1201211 . This dataset contains 2665 query-URL pairs and 177 
workers. Give a query-URL pair, a worker is required to provide a rating to measure how the 
URL is relevant to the query. The rating scale is 5-level: perfect, excellent, good, fair, or bad. 
On average, each pair was labeled by around 6 different workers, and each worker labeled 
around 90 pairs. More than 10 workers labeled only one query-URL pair. The ground truth 
labels used for evaluation are obtained via a consensus among a group of 9 search experts. The 
average error rate of the workers is 62.95%, compared to the error rate of random guessing 
at 80%. 


^Some of the datasets can be found at http://research.niicrosoft.com/en-us/projects/crowd/ 
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# classes 

^ items 

^ workers 

^ worker labels 

Bluebirds 

2 

108 

39 

4212 

Price 

7 

80 

155 

12400 

RTE 

2 

800 

164 

8000 

Temp 

2 

462 

76 

4620 

Age 

7 

1002 

165 

10020 

Web search 

5 

2665 

177 

15567 

Web spam 

2 

149 

18 

1901 


Table 2: Summary of the real crowdsourcing datasets used in our experiments. 



MV 

DS-EM 

DS-MF 

GLAD 

MMCE(M) 

Bluebirds 

24.07 

10.19 

10.19 

12.04 

8.33 

Price 

67.50 

65.00 

67.50 

68.75 

67.50 

RTE 

10.31 

7.25 

6.63 

7.00 

7.50 

Temp 

6.39 

5.84 

5.84 

5.63 

5.63 

Age 

34.88 

39.62 

36.33 

35.73 

31.14 

Web search 

26.93 

16.92 

18.24 

19.30 

11.12 

Web spam 

19.80 

13.42 

12.75 

18.12 

12.75 


Table 3: Error rates (in %) of various methods on real datasets. 


• Web spam. This dataset is provided by Microsoft web spam team. It contains 149 web pages 
and 18 workers. The workers are required to identify which web pages are spam. In average, 
each web page is labeled by around 13 workers. The ground truth labels used for evaluation 
are provided by web spam experts. The average error rate of the workers is 16.30%, compared 
to the error rate of random guessing at 50%. 

Table [2] shows a summary of these datasets. 


8.2 Methods 


We evaluate the following methods in our experiments: 

• Majority voting (MV). It is perhaps the simplest baseline. 


Daw id-Skene model + EM (DS-EM). Under the generative model by (jPawid and Skend . 
I979I ). this method jointly estimates workers’ parameters and true labels by maximizing the 
likelihood of observed labels with the EM algorithm. 

Dawid-Skene model + mean field (DS-MF). This method perf orms variational Bayesian 


inference using the mean held (ME) algorithm (jLiu et al.l . l2012l ). It assumes a Dirich- 


let prior parameterized by a vector on the k-th. row of the worker confusion matrix 
in the Dawid-Skene model with = ci, and ak^i = C 2 for all I ^ k. The hyper¬ 

parameters {ci,C2} are selected by maximizing the marginal likelihood calculated by ME, 
and searched in a 10 x 10 grid dehned by ci = C2 x {10°, 10°'^, 10°'^, • • • ,10^} and C2 = 
{10-^ 10-° ®, • • • , 10°, • • • , 10° ®, 10^. 
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MV 

DS-EM 

DS-MF 

LTA 

MMCE(M) 

MMCE(O) 

Price 

1.605 

1.517 

1.487 

1.504 

1.643 

1.466 

Age 

0.730 

0.852 

0.739 

0.696 

0.605 

0.794 

Web search 

0.930 

0.539 

0.559 

0.481 

0.419 

0.384 


Table 4: Mean square errors of various methods on ordinal datasets. 


Probability Bin 

(0,0.5) 

(0.5,0.6) 

(0.6,0.7) 

(0.7,0.8) 

(0.8, 0.9) 

(0.9,1) 

^ items 

173 

291 

292 

313 

406 

1178 

Error rate 

0.416 

0.381 

0.199 

0.080 

0.020 

0.001 

Mean square error 

0.832 

0.423 

0.250 

0.118 

0.035 

0.001 


Table 5: Positive correlation between probabilistic labels and errors. The results are from the 
regularized ordinal minimax conditional entropy method on the web dataset. 


GLAD. We use the multiclass version of GLAD proposed by ( MineircJ . 2011al i and also his 
open source implementation. 

Latent trait an alysis (LTA). It is a variant of the polytomous Rasch model proposed by 
( Mineirol . 2011 b ') with an open source implementation. 

Regularized minimax conditional entropy for multiclass labels (MMCE(M)). It is 

implemented with the Euclidian norm based regularization. 

Regularized minimax conditional entropy for ordinal labels (MMCE(O)). It is 

implemented with the Euclidian norm based regularization. 


The regularization parameters in MMCE are chose through the cross-validation procedure described 
in Section I6.2pl 


8.3 Results 

Table [3] shows the error rates of various methods on real crowdsourcing datasets. Our multiclass 
minimax conditional entropy method outperforms compared methods on most datasets. Table 0] 
shows the mean square errors of various methods on three ordinal datasets. Our ordinal minimax 
conditional entropy method performs best on the price and web search datasets but performs poorly 
on the age dataset. Table [5] shows the correlation between probabilistic labels and errors for our 
ordinal minimax conditional entropy method on the web dataset. Prom the results, the labels 
estimated with larger probabilities are more likely to be correct. We observed similar behavior 
for our multiclass minimax conditional entropy method. We also evaluated our method with the 
regularization in Equation (I16p and observed that this variant somewhat hurts performance on 
most datasets. 


9 Conclusion 

We have developed a minimax conditional entropy principle for aggregating noisy labels from crowd¬ 
sourcing workers. Our formulation involves two probabilistic distributions. One is the distribution 

®Our code are available at http://researcli.microsoft.com/en-us/projects/crowd/ 
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of the true labels of the items, and the other is the distribution under which the workers generate 
their labels for the items. Both the distributions are unknown. We jointly infer them by first maxi¬ 
mizing the entropy of the observed labels of the workers conditioned on the true labels of the items 
over the distribution of generating workers’ labels, and then minimizing the maximum entropy over 
the distribution of the true labels of the items. Empirical results on real crowdsourcing datasets 
validate our approach. 

We have considered aggregating multiclass and ordinal labels via minimax conditional entropy. 
The framework is general and shonld be extensible to many other labeling tasks in which the labels 
are s tructured in different ways , such as protein folding (IKhatib et al.l. 1201111. mach i ne tr ansla- 


tion (jZaidan and Ca llis on-Bnrch . 2nilll. hierarchical classification ( Koller and Sahami . 19971 ). and 


speech captioning (jMurphv et al.l . l2r)d ). To achieve the extension, the constraints for workers 


and items need to be customized specific to each domain, and this probably results in differently 
structured confusion matrices. 
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A Dual Form of Minimax Conditional Entropy 


To to derive the dual of minimax conditional entropy, we substitute the probabilistic model in 
Equation (l6|) into the Lagrangian ([5]) and obtain 


L = - ^ Q{Yj = c)J2 = c) log exp[iTi(c, k) + Tj{c, A:)]| 

i,j,c k ^ 

+ ^ (Ti(c, k) ^ Q{Yj = c) P{Xij = k\Yj = c) - I{xij = k) 

i, c,k j 

+ ^ ^ P{^ij = k\Yj = c) - I(xij = k) 

j, c,k i L 

+ ^PiX,j = k\Y,=c)-l 

ij,c ^ k 

= - =k\Yj = c)[ai{c,k)+Tj{c,k)] + '^logZr 

ij,c k 

+ Y, k) Y Qi^j = c) PiXij = k\Yj = c) - Iix^j = k) 

i, c,k j 

+ Yj X] ^ P{Xij = k\Yj = c) - I{xij = k) 

j, c,k i 

= -YQO^J = = k)[ai{c,k) + Tj{c,k)] - log Zijj 

>1 'i r' ' Li ' 






^ Q{Yj = c) log P(X,,- = Xij\Yj = c) 




B Proof of Theorem 12.1 

Let us first check y log Q{X, Y). By definition, 

j;g(x)iogg(x) = o, g(x|y) = g(x). 

X 


Hence, we have 


^g(x,y)iogg(x,y) = j;[g(y|y)g(y)]iog[g(x|y)g(y)] 

X,Y X,Y 

= Y\QiX)Q{Y)]\og[Q{X)Q{Y)] 

X,Y 

= Y Q{^) log Q{X) + Y Q(P) log Qi^) 

X Y 

= j;g(y)iogg(y). 

y 
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Next we check J2x Y log ^)- Write 

Q{X, Y) log P{X, Y) = Y, Q{X, Y) log P{X\Y) + ^ Q(X, Y)P{Y). 

X,Y X,Y X,Y 

Since P is a uniform distribution over Y, P{Y) is a constant. Thus, 

Q{X, Y) log P{Y) = log P{Y) Y, QiX, Y) = log P{Y), 

X,Y X,Y 

which is still a constant. By Equation ([6]), we have 


^ Q{X, Y) log P{X\Y) = Y = k, Yj = c) log P{Xij = k\Yj = c) 


X,Y 


i,j,c,k 


-- Y Q{Yij = k, Yj = c) log I exp[cJi(c, k) + Tj{c, k)] \ 

i,j,c,k J 

: Y, Qi^ij = k,Yj = c) [cTi(c, k) + Tj{c, k) - log Zij] . 

i,j,c,k 


By Equation (l2al) . we have 

Y. Q{Yij = k, Yj = c)ai{c, k) = Y k) Y K^ij = k)Q{Yj = c) 

i,j,c,k i,c,k j 

= Y k) Y = k\Yj = c)Q{Yj = c). 


i,c^k 


Similarly, by Equation 


Y Q{Yij = k, Yj = c)Tj{c, k) = Y pip k) Yj PiYij = k\Yj = c)Q{Yj = c). 

i,j,c,k j,c,k i 

In addition, since does not depend on k, 


Y 


k, Yj = c) log Zij = ^ ^ log Zij Y Qi^ij = k, Yj = c) 

i,j c k 

= EZ<3(llS=c)log^*i 

i,j c 

= E E = c) PiXij = k\Yj = c) log Z,,. 

i,j c k 


Putting all the pieces together, we have 

DkUQ II ^) = - E Qi^^ = c) E Pi^iJ = + Tj{c, k) - logz,j] 

j,c i,k 

+ YQiY)logQ{Y) - log P{Y) 

Y 

= - E Qi^^ = c) E Pi^ij = Pi^ij = 

j,c i,k 

+ YQiY)logQiY) - log P{Y). 

Y 
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Note that, when the true labels are deterministic, 


^Q(y) log Q(y) = o. 
y 

So, 

Dkl{Q \\P) = -Y. =^)Y. = k\Yj = c) log P{Xij = k\Yj = c) - log P{Y). 

j,c i,k 


This concludes the proof. 


C Dual Form of Regularized Minimax Conditional Entropy 


We derive the dual problem of regularized maximum conditional entropy with the sum-to-zero con¬ 
straints in Equation (jl5h . The dual derivation without the additional constraints can be obtained 
in a similar procedure. Let us write the Lagrangian as 

L = H{X\Y) - H{Y) - ^0(0 - ^d'(C) + L^ + Lr + Lx + L^, (26) 

in which 


La 

Lr 

Lx 

L, 


'^ai{c,k) ii{c,k) - ^ (^(j)ij{c,k) - ^ij{c,k)^ 
i,c^k ^ j 

^ Tj{c, k) Cj(c, k) k) - ^ijic, kfj 


j,c,k 


Y^Xijc Y.P{Xij = k\Y,=c)-l 

i,j,c ^ k,c 


By the KKT conditions, maximizing L with respect to P results in 


dL 

dP{Xij = k\Yj = c) 


= -logP(W, 


k\Y, 


c) - 1 + Xijc + crj(c, k) + Tj{c, k) = 0. 


As showed in Section [2l this leads to the probabilistic model in Equation ()3ap . Similarly, maximizing 
L with respect to ^ results in 


dL 

d^i{c,k) 

dL 

d^i{c, k) 


ai{c, k) - —ii{c, k) = 0, Vc, k ^ c, 
a 

ai{c, c) - c) + fii = 0, Vc. 


So we have 


Ci{c, k) = acTi(c, k), Vc, A; 7^ c, 
^i(c,c) = a[ai{c,c) + ^li], Vc. 
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(27) 

(28) 











Moreover, maximizing L with respect to Q results in 


dL 1 

a^=r,(c,fc)-;gCi{c.*)=0, Vc,t. 


Hence, 


C,i{c,k) = /3Ti{c,k), \/c,k. 

Substituting ([6]), (l28]l . and (1^ into the Lagrangian (IMl) . we have 

L = - Y^Q{Yj = c) log exp[cri(c, Xij) + Tj{c, Xij)]| - H{Y) 


a 

2 


i,c '' k^c 




i,c,k 


+ “X] I + ai{c,c)[ai{c,c) +/r*] \ + ^ ^[r,-(c. A:)] 

i,c ^ k^c ^ i,c,k 

CX E c)[cJi(c,c) + Hi] 

i,c 

Y^Q{Yj = c)log |^exp[cri(c,/i;) +rj(c, fe)]| -H{Y) 


*J,C 

a 


i,c ^ k^c 

By minimizing the Lagrangian over /ij, we obtain 


+ [^ii<^^c)+Hi?\ + ^Yl^Tjic,k)f. 


i,c,k 


fli — (Ti (c, c). 


So, the dual problem can be expressed as 


min - Q{Yj = c) log | ^ exp[o-j(c, xtj) + t,(c, x^)] I - iL(y) 


*J,C 

a 


+ I + cri(c,c) - CJi(c,C) 

2,C ^ /c/c 


i.c.k 


( 29 ) 


Let us replace cJi(c, k) with crj(c, A:)+r'i. It is easy to verify that this dual problem can be equivalently 
written as 

min - V Q{Yj = c) log j ^ exp[cri(c, Xij) + Tj{c, Xij)] \ - H{Y) 

Q-cr.r,!/ Zij J 


«J,C 

a 


i,c k^c 


+ o-i(c,c) - cJi(c,c) I + §y][r,(c,/c)]2. 


i,c^k 
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Minimizing the objective function over v leads to 

min - V Q{Yj = c) log | ^ exp[cri(c, Xij) + Tj{c, Xij)] \ -H{Y) 

I A? J 

l^J,C J >' 

+ - cri{c,k) + (T* (c, c) - fjj (c, c) \ + ^'^[Tj{c,k)f. 

i,c '' k^c ^ i,c,k 

D Coordinate Algorithm 

To solve the dual problem 

max Q(Yj = c)y^ log P{Xij = Xij\Yj = c) + H{Y) - aQ*{a) - /3T*(r) 

i,c i 

subject to the probability constraints 

Y,Q{Y^ = c) = l,Vj, Q{Y^ = c) > 0, Vi,fe, 

c 

we first split the variables into two groups and then alternatively update them. One group contains 
the parameters of workers and items in P{Xij = Xij\Yj = c), that is, {ai{c,k),Tj{c,k),yi,j,c,k} 
and the other groups contains the unknown true labels {Q{Yj = k),Vj,k}. When we update the 
variables in the first group, the variables in the second group take their current values. Then, the 
optimization problem becomes 

max Q{Yj = c) y^logP(Xj,' = Xij\Yj = c) — ail*(a) — /3T*(r). 

cr,T ^ ^ ^ 

i 

Instead, when we update the variables in the second group, the variables in the first group take 
their current values. We thus have the optimization problem 

max Q{Yj = c) log P{Xij = Xij\Yj = c) + H{Y) 

Q 

J,c ^ 

subject to the above probability constraints. This constrained optimization problem can be solved 
with the Lagrangian dual 

L = y2 Q{Yj = c) log P{Xij = Xij\Y, =c) + H{Y) - J] A,- J] Q{Yj = c) - 1 , 

j,c i L c -I 

where Aj’s are the Lagrangian multipliers. By the KKT conditions, 
or 

dQ{Yj = c) = = c) - ^ogQ{Yj = c) + l-Xj=0. 

This implies 

Q{Yj = c) <xllP{Xij = Xij\Yj = c). 
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